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1.  INTRODUCTION 


The  field  of  molecular  dynamics  is  of  considerable  importance  to  the  molecular  modeling  community. 
The  general  approach  of  molecular  dynamics  is  to  treat  the  motion  of  particles  classically,  rather  than  by 
the  more  rigorous  rules  of  quantum  mechanics.  In  addition,  it  is  common  to  express  the  interaction 
between  the  molecules  or  particles  concerned  in  terms  of  simple  model  functions,  (for  example,  the 
Lennard-Jones  or  Morse  potentials  between  pairs  of  particles),  instead  of  calculating  and  calibrating  a  full 
potential  energy  surface  based  upon  ab  initio  methods. 

These  approximations  are  performed  primarily  for  two  reasons,  both  of  which  are  related  to  the  size 
of  the  system  to  be  treated.  The  first  is  for  computational  simplicity:  the  largest  chemically  reactive 
systems  treated  to  date  by  rigorous  quantum  mechanical  methods  consist  of  only  three  atoms,  for  example 
the  reactions 

H  +  H2 - >  H2  +  H 

or 

F  +  H2 - >  HF  +  H  , 

each  of  which  required  hundreds  of  hours  of  Cray  supercomputer  time  despite  their  apparent  simplicity. 
Although  the  accuracy  of  the  method  is  limited  by  the  classical  approximation,  molecular  dynamics 
calculations  can  reasonably  simulate  systems  containing  hundreds  or  even  thousands  of  atoms.  Rigorous 
quantum  mechanical  treatment  of  systems  of  this  size  is  currently  beyond  the  state  of  the  art. 

The  second  reason  is  paradoxically  an  advantage  related  to  the  size  of  the  system.  In  most  molecular 
dynamics  simulations,  what  is  sought  is  not  detailed  information  concerning  the  internal  state  of  each 
molecule  involved,  but  bulk  properties,  such  as  temperature  or  diffusion  coefficients.  These  properties 
are  best  calculated  by  the  treatment  of  a  large  number  of  molecules  simultaneously,  where  averaged 
behavior  is  of  interest.  This  averaging  is  usually  performed  by  making  the  ergodic  assumption,  i.e.,  that 
the  average  value  of  a  quantity  over  time  is  the  same  as  the  average  value  of  that  quantity  over  a  large 
number  of  particles.  Despite  the  simplifying  approximations  described  previously,  molecular  dynamics 
calculations  remain  formidable  computational  tasks,  due  to  the  large  number  of  particles  involved,  and  the 
amount  of  time  for  which  the  molecular  motions  must  be  followed.  This  paper  presents  the  results  of 
several  optimization  steps  for  a  prototype  molecular  dynamics  program  on  two  IBM  RS/6000  scalar  work 
stations  (a  Model  370  and  a  Model  550)  and  several  Cray  vector  supercomputers  (a  Cray  X-MP/48, 
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a  Cray  Y-MP8/128,  and  a  Cray  C-90/16256).  The  model  system  is  first  described,  including  the  periodic 
boundary  conditions,  the  interaction  potential,  and  the  method  of  tracking  each  particle.  The  FORTRAN 
implementation  of  this  model  is  described,  and  various  optimization  schemes  are  applied  to  the  program 
on  each  machine.  The  efficiency  of  each  optimization  scheme  is  compared,  and  the  application  of  these 
schemes  to  more  general  programs  is  considered. 

2.  THEORY  AND  MODEL 

2.1  System.  The  system  modeled  by  the  present  program  is  a  two-dimensional  (2-D)  herringbone 
lattice  of  diatomic  molecules  which  are  to  be  perturbed  by  a  shock  wave  passing  from  one  end  of  the 
lattice  to  the  other.  The  time  history  of  the  system  is  followed  by  integration  of  Hamilton’s  equations  of 
motion  using  a  modified  variable  step-size  Adams-Moulton  fourth-order  predictor-corrector  integrator 
(Miller  and  George  1972),  with  relative  error  tolerance  set  at  10~^.  The  simulation  is  terminated  by  a 
user-defined  cutoff  on  the  number  of  integration  steps  allowed. 

The  system  used  for  the  purposes  of  this  study  consists  of  192  atoms,  all  of  which  are  allowed  to 
move  in  the  simulations.  The  atoms  are  laid  out  in  a  rectangular  grid  of  12  rows  and  18  columns,  as 
shown  in  Figure  1. 

2.2  Periodic  Boundary  Conditions.  The  periodic  boundary  conditions  used  in  this  model  wiU  be 
described  first,  since  they  are  intrinsic  to  both  the  tracking  and  identification  of  particles  (see  2.3)  and  the 
interaction  potential  (see  2.4).  The  program  constructs  a  2-D  rectangular  lattice  of  dimension  M  x  N, 
where  M  and  N  are  user-supplied  parameters  read  in  by  the  program.  The  physical  system  (~10^^  atoms) 
to  be  modeled  is  much  larger  than  the  number  of  particles  that  may  be  simulated  (-10'*).  We  are 
interested  in  the  properties  of  a  bulk  crystal  undergoing  shock,  rather  than  a  tiny  crystal  with  large  surface 
area.  For  a  system  of  lO'*  atoms  there  tends  to  be  a  substantial  number  of  atoms  in  the  simulation  box 
which  are  in  close  proximity  to  the  edges  of  the  box.  These  atoms  behave  as  "surface"  atoms  rather  than 
as  atoms  in  the  bulk.  Of  course,  this  would  dramatically  affect  the  computed  final  results,  and  would  not 
give  a  physically  meaningful  description  of  bulk  properties. 

When  periodic  boundary  conditions  are  enforced,  a  particle  exiting  the  simulation  box  at  one  edge  is 
replaced  by  another  particle  entering  the  simulation  box  at  the  opposite  edge,  with  identical  mass  and 
velocity  vector  of  equal  magnitude  and  direction.  (To  visualize  this  better,  consider  a  sheet  of  paper  rolled 
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Figure  1.  Test  system  for  benchmaric  studies. 
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into  a  cylinder.  A  pencil  line  drawn  off  of  the  top  of  the  page  immediately  crosses  over  to  the  bottom 
of  the  page  with  no  discontinuity.  If  the  periodic  boundary  conditions  are  applied  in  both  the  horizontal 
and  the  vertical  directions,  the  cylinder  becomes  a  torus.)  The  periodic  boundary  condition  method 
effectively  allows  the  description  of  an  infinite  system  by  a  finite  number  of  particles.  The  method  is 
described  in  more  detail  in  Allen  and  Tildesley  (1990).  For  the  current  simulation,  periodic  boundary 
conditions  are  enforced  in  both  directions. 

2.3.  Tracking  and  Identification  of  Particles.  The  technique  used  for  the  tracking  and  identification 
of  particles  during  the  simulation  is  the  method  of  cells  and  linked  lists,  as  described  in  Allen  and 
Tildesley  (1990).  However,  this  method  has  been  modified  for  our  purposes,  since  the  current  system  is 
only  2-D.  The  rectangular  simulation  box  is  divided  into  a  rectangular  lattice  of  x  Ly  cells,  as  shown 
in  Figure  2,  where  the  origin  of  the  system  is  the  lower  left-hand  comer  of  cell  1.  The  value  of 
(q  =  X,  y)  is  chosen  so  that  the  length  of  each  cell  in  each  dimension,  d^,  is  larger  than  the  cutoff  radius  r 
of  the  interaction  potential  (see  2.4).  Also,  each  cell  must  be  a  multiple  of  the  unit  cell  of  the  crystal. 
For  the  system  in  this  study,  the  cell  sizes  are  twice  the  size  of  the  unit  cell  in  both  the  x  and  y  directions, 
making  =  4  and  Ly  =  3. 

As  an  example,  consider  cell  6  in  Figure  2.  The  only  particles  that  may  be  considered  as  neighboring 
particles  of  those  in  ceU  6  are  those  in  cells  1, 2,  3,  5, 7,  9,  10,  and  11.  All  particles  in  aU  other  cells  are 
outside  the  range  of  the  interaction  potential.  As  an  example  of  an  edge  cell,  consider  cell  1.  The  only 
particles  that  are  considered  neighbors  of  those  in  cell  1  are  those  in  cells  12,  9,  10,  4,  2,  8,  5,  and  6. 
Since  each  of  these  cells  has  a  unique  list  of  the  particles  it  contains,  searching  through  the  list  of  particles 
in  the  neighboring  cells  is  a  much  more  rapid  process  than  testing  every  particle  in  the  entire  box. 

2.4  Interaction  Potential.  The  interaction  potential  describing  this  system  was  developed  by  Brenner 
(1991).  It  is  a  pair-additive  function,  which  incorporates  a  many-body  effect  to  the  bonding  environment 
between  two  atoms.  Its  functional  form  is: 
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Parameters  for  equations  1-6  are  given  later  in  the  report  in  Table  1.  We  wish  to  point  out  that  there  are 
several  changes  in  this  potential  fiom  that  given  by  Brenner  (1991).  First  of  all,  the  equation  reproduced 
in  equation  4  here  was  written  incorrectly  in  Brenner  (1991).  The  correct  form  is  in  equation  4.  Values 
of  the  P’s  were  given  in  White  (1992);  however,  using  these  parameters  led  to  a  discontinuity  in  the 
potential.  We  have  therefore  chosen  the  P’s  in  Table  1  such  that  this  polynomial  has  continuous  first 
derivatives.  Also,  there  were  two  discontinuities  in  Vj,jg,  which  we  have  corrected  by  adding  quintic 
splines.  The  coefficients  of  the  quintic  splines  were  chosen  such  that  the  first  and  second  derivatives  of 
this  function  were  continuous. 


3.  COMPUTER  SYSTEMS 


The  computer  systems  on  which  the  benchmarking  studies  have  been  performed  are  a  Cray  Research, 
Inc.  Cray  X-MP/48,  with  operating  system  UNICOS  version  6.1.3,  a  Cray  Research,  Inc.  Cray  Y-MP8/128, 
with  operating  system  UNICOS  version  7.0.4.4,  a  Cray  C-90/16256,  with  operating  system  7.C.3 
prerelease,  and  two  IBM  RS/6()00  RISC  woikstations  (a  Model  550  and  a  Model  370,  each  with  operating 
system  AIX  version  3.2).  The  Crays  are  conventional  vector  supercomputers.  For  the  Crays,  numerical 
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calculations  (such  as  FORTRAN  DO  LOOPs)  may  be  partially  parallelized  by  the  vector  registers,  and 
other  FORTRAN  constructs  are  handled  sequentially.  The  IBM  work  stations  are  scalar  processors  with 
reduced  instruction  set  chip  (RISC)  architectures  that  may  handle  certain  FORTRAN  tasks  very  efficiently 
due  to  specific  low-level  commands  within  the  reduced  instruction  set. 

The  next  section  of  this  report  relates  the  performance  of  this  prototype  molecular  dynamics  code  on 
each  of  the  machines,  and  compares  the  results  on  each  machine  when  analogous  compiler  optimizations 
and  FORTRAN  constructs  are  included  in  the  program  on  each  machine.  Although  there  are  differences 
between  the  Cray  X-MP  architecture  and  that  of  the  Y-MP  and  C-90  (e.g.,  the  length  of  the  vector 
registers  on  the  C-90  is  256  compared  to  64  on  the  X-MP),  these  machines  are  similar  enough  that  the 
discussion  of  the  X-MP  vector  architecture  is  applicable  to  the  Y-MP  and  C-90  as  well. 

4.  OPTIMIZATION 

This  section  details  the  performance  of  the  computer  program  on  each  of  the  machines  listed  in 
section  3,  together  with  a  description  of  accuracy  tests  on  the  output,  which  are  discussed  in  more  detail 
later.  Section  4.1  describes  the  unoptimized  version  of  the  program  and  the  checks  on  accuracy  of  the 
final  results.  Section  4.2  describes  changes  to  important  FORTRAN  kernels  which  speed  execution  on 
some  of  the  machines.  Section  4.3  describes  the  influence  of  various  compiler  options  on  the  scalar  and 
vector  performance  of  the  program. 

4.1.  Unoptimized  Program  and  Accuracy  Tests.  This  section  describes  the  perfoimance  of  the  raw, 
unoptimized  FORTRAN  on  each  of  the  four  platforms  of  section  3.  With  only  two  exceptions,  the  default 
compiler  options  were  used  in  each  case.  The  first  exception  is  common  to  all  of  the  machines  tested, 
and  is  the  use  of  a  compiler  flag  to  detect  and  warn  of  all  non-ANSI  standard  FORTRAN  77  within  the 
program.  This  was  done  to  ensure  that  the  same  program  was  run  on  all  of  the  machines.  The  second 
exception  was  for  the  Cray  computers.  Most  computers  represent  real  floating-point  numbers  as  a  32-bit 
word,  which  is  referred  to  as  "single  precision"  accuracy.  This  representation  allows  four  figures  after  the 
decimal,  which  is  not  sufficiently  accurate  for  most  scientific  computations.  To  overcome  this  difficulty, 
most  scientific  computer  programs  are  written  in  "double  precision"  accuracy  (64-bit  words),  which  allows 
eight  figures  after  the  decimal  point.  However,  the  Cray  computer  architecture  was  designed  from  the  start 
primarily  for  scientific  applications,  and  so  the  Cray  architecture  allows  for  64-bit  words  already  within 
single  precision.  To  ensure  compatibility  of  the  same  FORTRAN  program  on  all  of  the  machines,  the 
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compiler  option  "-dp"  was  used  on  the  Cray.  This  option  tells  the  Cray  to  treat  all  variables  labeled  as 
double  precision  as  though  they  were  Cray  single  precision,  so  that  the  same  64-bit  word  length  is  used 
on  all  machines  in  the  benchmark  tests. 

There  are  two  tests  which  are  used  to  ensure  the  accuracy  of  the  results  on  all  five  of  the  machines. 
These  tests  are  used  to  make  certain  that  the  program  changes  and  compiler  options  introduced  later  do 
not  lead  to  errors  in  the  program.  The  first  test  is  that  of  conservation  of  energy.  That  is,  the  total  energy 
of  the  system  may  be  described  at  each  step  by 

E  =  V  -H  T  (7) 

where  V  is  the  potential  energy  described  in  equation  1,  and  T  is  the  kinetic  energy  of  the  system.  The 
possibility  of  error  in  energy  conservation  arises  within  the  numerical  integration  step,  when  the  solution 
to  Hamilton’s  equation  of  motion  for  the  system  is  propagated  numerically.  The  error  of  the 
Adams-Moulton  integration  scheme  is  on  the  order  of  h^  where  h  is  the  time  step  used  in  the  integration. 
If  too  large  of  a  time  step  is  used,  then  numerical  errors  will  appear  in  the  predicted  values  of  the 
momenta  of  the  particles,  and  these  will  intensify  as  the  propagation  is  continued.  Conversely,  if  too  small 
of  a  time  step  is  used,  then  round-off  errors  in  the  numerical  quantities  will  accumulate  and  affect  the 
accuracy  of  the  results.  For  the  simulation  described  here,  a  maximum  tolerance  of  0.1%  is  allowed  for 
variation  of  the  total  energy  of  the  system  at  each  step.  The  second  check  on  the  final  results  is  given 
in  detail  in  Tsai  (1979).  Briefly,  this  check  involves  two  independent  methods  of  computing  the  average 
pressure  of  the  system  over  the  course  of  the  simulation.  The  first  of  these  is  to  compute  the  pressure 
from  the  internal  stress  of  the  system: 


P  =  J  (Ox*  +  Oyy)  (8) 

where  and  Oyy  are  the  internal  normal  components  of  the  stress  of  the  system.  We  note  that  we 
calculated  and  Cyy  at  imaginary  dividing  surfaces  located  on  the  edge  of  every  unit  cell  of  this  crystal, 
and  used  the  time-averaged  value  of  these  components  in  equation  8.  The  second  of  these  is  to  calculate 
the  pressure  using  the  virial  theorem  (Tsai  1979).  Because  the  internal  pressure  is  being  calculated  for 
a  system  at  thermal  equilibrium,  it  is  expected  that  the  virial  theorem  is  valid,  and  that  the  two  methods 
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for  calculating  the  pressure  of  this  system  should  be  comparable.  Variance  of  approximately  1  %  in  the 
average  pressure  calculated  by  the  two  methods  may  be  expected  (Tsai  1979). 

Initially,  all  of  the  atoms  are  in  the  equilibrium  position;  the  potential  energy  of  the  system  is  zero. 
The  total  energy  of  each  atom  is  taken  to  be  kT,  where  k  is  the  Boltzmann  constant  and  T  is  set  to 
298.15  K.  The  energy  of  each  atom  is  equipartitioned  into  the  X  and  Y  momenta  components,  with  the 
sign  of  each  component  assigned  randomly.  The  equations  of  motion  for  the  system  are  integrated  for 
2,0(X)  steps  (~0.08  ps)  to  "randomize"  the  energy  of  the  system.  From  this  point  through  the  next  10,(XX) 
integration  steps,  the  internal  pressure,  virial,  and  energy  were  calculated  and  stored  for  averaging  at  each 
tenth  step  of  the  simulation.  At  the  end  of  the  total  12,0(X)  integration  steps,  the  simulation  was  stopped. 
The  total  simulation  time  period  was  approximately  0.4  ps. 

The  next  section  of  this  report  discusses  simple  changes  that  may  be  made  to  the  FORTRAN,  using 
widely  supported  and  publicly  available  subprograms  that  help  to  improve  the  floating-point  performance 
of  the  program. 

4,2  Code  Optimizations.  As  a  first  step  towards  optimization  of  the  programs,  it  is  necessary  to 
understand  the  differences  between  the  Cray  computer  and  the  work  stations.  The  Cray  machine  achieves 
its  high  floating-point  performance  through  two  factors:  its  high  clock  speed  and  its  architecture.  The 
clock  speed  of  the  machine  is  the  maximum  rate  at  which  instructions  can  be  sent  from  the  CPU.  For  the 
Cray  XMP/48  used  here  the  clock  speed  is  8.5  ns,  or  approximately  120  MHz.  The  clock  speed  of  the 
Y-MP/8128  is  6.0  ns  (~165  Mhz),  and  the  clock  speed  of  the  C-90/16256  is  4.2  ns  (~240  Mhz).  By 
comparison,  the  clock  speed  of  the  IBM  Model  550  is  approximately  42  MHz,  and  the  clock  speed  of  the 
IBM  Model  370  is  approximately  62  MHz).  However,  due  to  the  fact  that  several  instructions  must  be 
issued  for  each  floating-point  operation  that  is  performed,  the  clock  speed  is  greater  than  the  millions  of 
floating-point  operations  per  second  (MFLOPs)  rate.  Furthermore,  the  exact  number  of  instructions  which 
must  be  issued  for  each  floating-point  operation  differs  from  machine  to  machine  and  for  different 
floating-point  operations  (e.g.,  computation  of  a  SINE  function  is  much  more  costly  than  a  floating-point 
addition)  so  that  an  exact  relation  of  clock  speed  to  program  execution  rate  is  impractical.  Therefore  in 
the  discussion  that  follows  we  concentrate  upon  the  architecture  of  the  Grays  as  compared  to  the  IBM 
work  stations,  and  the  influence  of  these  architectural  differences  between  the  machines  upon  CPU 
performance. 


9 


The  Cray  machines  are  vector  pipeline  computers.  This  means  that  if  there  exists  a  floating-point 
operation  that  must  be  performed  identically  on  some  subset  of  the  data  within  memory,  and  that  the  data 
is  contiguous  or  each  element  of  the  data  is  separated  by  a  constant  interval,  then  the  Grays  can  read  the 
data  to  be  manipulated  from  memory  into  the  vector  registers,  perform  the  pipelined  floating-point 
operations  on  the  data,  and  write  the  result  back  to  memory.  A  typical  example  of  such  a  situation  in 
FORTRAN  is  the  DO  LOOP,  for  example: 


DIMENSION  A(IOOO) 
DO  10  1=1,1000 
A(I)  =  A®  *  0.023 
10  CONTINUE 


where  A  is  the  name  of  the  array  on  which  the  identical  operations  are  to  be  performed;  A  is  stored 
contiguously,  and  each  element  of  A  is  to  be  multiplied  by  a  constant.  On  a  Cray  machine,  the 
multiplication  of  elements  of  A  would  be  performed  in  parallel  (vectorized),  where  the  number  of  elements 
of  A  multiplied  at  one  time  are  dependent  upon  the  length  of  the  Cray's  vector  register  (this  number  ranges 
from  64  for  the  Cray  X-MP/48  to  256  on  the  Cray  C-90/16256).  Once  the  vector  register  is  loaded, 
floating-point  operations  are  produced  at  a  rate  of  one  every  clock  cycle.  There  exists  some  overhead  in 
setting  up  the  vectotization  of  a  DO  LOOP,  so  that  the  time  saved  by  the  vectorization  may  not  always 
be  sufficient  to  make  up  for  the  time  lost  in  setting  up  the  vectorization.  Therefore,  vectorization  of  small 
DO  LOOPS  may  actually  slow  down  the  execution  of  a  program. 

It  should  also  be  noted  that  some  FORTRAN  constructs  will  prevent  vectorization.  For  example,  the 
following  DO  LOOP  would  not  be  vectorized: 


DIMENSION  A(IOOO) 
DO  10  1=1,1000 
IF(I.LT.200)THEN 
A(I)  =  A(I)  *  0.023 
ELSE 

A(I)  =  A®  +  6.152 
END  IF 

10  CONTINUE 


because  different  elements  of  the  vector  A  would  have  different  arithmetical  operations  performed  on 
them.  The  following  loop  would  also  be  vectorized: 
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DIMENSION  A(IOOO),  B(IOOO) 

DO  20  1=1,  1000 
A®  =  A(I)  +  B(I)  *  12 

20  CONTINUE 

In  the  latter  case,  once  the  vector  registers  have  been  loaded,the  results  of  the  multiplication  are  "chained" 
to  the  vector  register  to  perform  the  addition  at  the  same  time,  so  that  the  net  result  is  two  floating-point 
operations  every  clock  cycle. 

On  scalar  work  stations,  there  are  no  vector  registers,  but  the  rate  of  execution  of  a  FORTRAN  kernel 
that  would  vectorize  on  the  Cray  is  limited  by  the  rate  at  which  data  may  be  transferred  from  memory  to 
the  floating-point  processor.  On  the  IBM  woik  stations,  this  is  controlled  by  a  local  cache  which  transfers 
data  from  main  memory  to  the  processor.  If  an  instruction  requires  data  not  in  the  cache,  then  a  page  fault 
will  occur,  which  delays  execution  until  the  proper  data  can  be  accessed  from  memory.  Explicitly  writing 
a  FORTRAN  code  which  is  guaranteed  to  avoid  such  delays  is  very  time-consuming  and  produces 
complicated  DO  LOOPs.  Fortunately,  this  difficulty  is  easily  circumvented  as  described  next. 

A  set  of  portable  routines,  the  BLAS  (Basic  Linear  Algebra  Subprograms),  which  are  available  on  a 
wide  variety  of  machines,  including  the  Cray  and  the  IBM  work  stations,  were  designed  to  help 
circumvent  these  problems.  (The  two  DO  LOOPs  in  the  previous  examples  are  reproducible  u^ng  the 
BLAS  subprograms  DSCAL  and  DAXPY,  respectively.)  The  BLAS  routines  provide  a  standard  calling 
sequence  for  a  number  of  simple  FORTRAN  kernels  which  occur  repeatedly  in  scientific  applications,  and 
have  been  implemented  as  machine-optimized  library  routines  by  a  number  of  hardware  vendors. 
Furthermore,  the  calling  sequences  and  descriptions  of  the  BLAS  are  publicly  available  free  of  charge  via 
electronic  mail  (netlib@oml.gov).  Since  the  BLAS  routines  are  machine  optimized,  and  are  widely 
portable,  we  decided  to  modify  the  existing  molecular  dynamics  prototype  program  to  include  them  on 
the  scalar  woik  stations.  We  did  not  include  these  routines  in  the  Cray  version  of  the  program  for  two 
reasons.  First,  all  FORTRAN  kernels  within  the  present  program  which  are  eligible  for  replacement  by 
the  BLAS  are  already  vectorized  by  the  Cray's  compiler  and  so  perform  efficiently.  Furthermore,  the 
on-line  documentation  supplied  by  Cray  Research,  Inc.  on  the  Cray  machines  indicates  that  for  many  cases 
(including  those  situations  considered  in  the  present  program),  the  performance  of  the  program  created  by 
the  Cray  vectorizing  compiler  exceeds  that  of  the  program  the  BLAS  (even  the  Cray  Research,  Inc. 
scientific  subroutine  library  version  of  the  BLAS). 
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We  now  give  a  description  of  those  routines  within  the  molecular  dynamics  code  in  which  the  BLAS 
routines  were  incorporated,  and  illustrate  the  application  of  the  BLAS  routines  to  FORTRAN  constructs 
for  which  several  alternative  implementations  were  possible.  This  is  done  progressively  in  order  that  the 
reader  may  leam  to  apply  the  BLAS  to  his/her  own  application  in  the  most  efficient  way  possible.  The 
four  subroutines  in  the  program  in  which  the  BLAS  routines  were  included  are  as  follows: 

PROGRAM  DRIVER:  the  main  driver  for  the  program 
SUBROUTINE  POTEN:  calculates  the  potential  and  derivatives 
SUBROUTINE  VAM:  Adams-Moulton  numerical  integrator 

SUBROUTINE  VPRIME:  calculates  time  derivatives  of  coordinates  and  momenta  for 

SUBROUTINE  VAM 


Not  all  of  these  programs  are  called  the  same  number  of  times,  and  so  not  all  of  these  routines  require 
an  equal  amount  of  CPU  time  for  execution.  For  the  purposes  of  this  discussion,  we  concentrate  on  the 
two  routines  VAM  and  VPRIME,  since  these  two  routines  contain  aU  of  the  different  types  of  FORTRAN 
DO  LOOPS  which  were  replaced  by  the  BLAS  in  the  other  two  routines. 


SUBROUTINE  VPRIME  consists  of  the  DO  LOOP 


DO  10  I  =  1,  NMOVE 
XAU)  =  xa) 

YA®  =  Y(I) 

10  CONTINUE 


followed  by  a  caU  to  POTEN.  This  DO  LOOP  is  trivially  replaced  by  the  simplest  of  the  BLAS  routines, 
DCOPY  (the  complete  calling  sequence,  including  a  description  of  each  argument,  is  given  for  this  and 
for  the  other  BLAS  shown  next): 

CALL  DCOPY  (NMOVE,  X(l),  1,  XA(1).  1) 

CALL  DCOPY  (NMOVE,  Y(l),  1,  YA(1).  1) 

The  call  to  POTEN  is,  of  course,  not  changed. 
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SUBROUTINE  VAM  contains  the  following  DO  LOOPs  (in  addition  to  other  code): 


DO  100  J  =  5.  14 
DO  100  I  =  1,  NEQ 
100  T(J,I)  =  ZERO 

DO  3001=  l.NEQ 
T(3,I)  =  Y(I) 

T(9,I)  =  DY(I) 

IF  (NORDER  .GT.  0)  THEN 
DEN  =  ZERO 
DO  250  J  =  1,  NORDER 
DEN  =  DEN  +  H(J) 

T(J+9,I)  =  (T(J+8,I)-T(J+3,I))/DEN 
T(J+3,I)  =  T(J+8.I) 

250  CONTINUE 
END  IF 

T(N0RDER44,I)  =  T(NORDER+9.I) 

300  CONTINUE 

DO  500  I  =  1,  NEQ 
TEMP  =  0.00 

IF  (NORDER  .GT.  0)  THEN 
DO  450  J  =  1,  NORDER 
TEMP  =  TEMP  +  T(4+J,I)*AA(J) 

450  CONTINUE 
END  IF 

T(1,I)  =  TEMP  +  T(4,I)*H(1) 

Y(I)  =  T(1,I)  +  T(14.I)  +  T(3,I) 

500  CONTINUE 

DO  600  I  =  1,  NEQ 
T(9,I)  =  DY(I) 

DEN  =  ZERO 

IF  (NORDER  .GT.  0)  THEN 
DO  550  J  =  1,  NORDER 
DEN  =  DEN  +  H(J) 

T(J+9,I)  =  (T(J+8,I)  -  T(J+3,I))/DEN 
550  CONTINUE 
END  IF 

600  CONTINUE 
DO  700  I  =  1,  NEQ 
TEMP  =  0.00 

IF  (NORDER  .GT.  0)  THEN 
DO  650  J  =  1,  NORDER 
TEMP  =  TEMP  +  T(9+J,I)*BB(J) 

650  CONTINUE 
END  IF 
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T(2,I)  =  T(9,I)*H(1)  -  TEMP 
Y(I)  =  T(2.I)  +  T(14,I)  +  T(3,I) 

T(14.I)  =  Cr(2,I)  +  T(14,I))  -  (Y(I)  -  T(3.I)) 
700  CONTINUE 


While  these  DO  LOOPS  may  seem  complicated,  it  will  be  shown  that  they  all  may  be  rewritten  in  terms 
of  four  BLAS  routines:  DAXPY,  DSCAL,  DDOT,  and  DCOPY.  For  reference,  we  give  the  function  and 
a  generic  argument  list  for  each  of  these  routines. 

DAXPY:  multiply  the  elements  of  one  vector  X  by  a  constant  CON,  and  add  the  results  to  another 
vector  Y. 

CALL  DAXPY(NN.  CON,  X(lst),  DX,  Y(lst),  DY) 


where 


NN  is  the  number  of  elements  of  X  and  the  number  of  elements  of  Y  to  be  accessed. 
CON  is  the  constant  by  which  to  multiply  the  elements  of  X. 

X(lst)  is  the  first  element  of  vector  X  to  be  accessed. 

DX  is  the  interval  between  successively  accessed  elements  of  X. 

Y(lst)  is  the  first  element  of  vector  Y  to  be  accessed. 

DY  is  the  interval  between  successively  accessed  elements  of  Y. 

DSCAL:  multiply  the  elements  of  a  vector  by  a  constant. 

CALL  DSCAL(NN,  CON,  X(lst),  DX) 

where 

NN  is  the  number  of  elements  of  X  to  be  accessed. 

CON  is  the  constant  by  which  to  multiply  the  elements  of  X. 

X(lst)  is  the  first  element  of  vector  X  to  be  accessed. 

DX  is  the  interval  between  successively  accessed  elements  of  X. 

DDOT:  take  the  dot  product  of  two  vectors. 

CONSTANT  =  DDOT(NN,  X(lst),  DX,  Y(lst),  DY) 

where 

CONSTANT  will  contain  the  final  value  computed  by  FUNCTION  DDOT. 

NN  is  the  number  of  elements  of  X  and  the  number  of  elements  of  Y  to  be  accessed. 
X(lst)  is  the  first  element  of  vector  X  to  be  accessed. 
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DX  is  the  interval  between  successively  accessed  elements  of  X. 

Y(lst)  is  the  first  element  of  vector  Y  to  be  accessed. 

DY  is  the  interval  between  successively  accessed  elements  of  Y. 

DCOPY:  copy  the  contents  of  one  vector  to  another  vector. 

CALL  DCOPY(NN,  X(lst).  DX,  Y(lst),  DY) 

where 

NN  is  the  number  of  elements  of  X  and  the  number  of  elements  of  Y  to  be  accessed. 
X(lst)  is  the  first  element  of  vector  X  to  be  accessed. 

DX  is  the  interval  between  successively  accessed  elements  of  X. 

Y(  1st)  is  the  first  element  of  vector  Y  to  be  accessed. 

DY  is  the  interval  between  successively  accessed  elements  of  Y. 


Inspection  of  the  DO  LOOPS  shown  previously  shows  that  the  sequence  DO  250/DO  300  has  the 
same  structure  as  the  sequence  DO  550/DO  600,  so  that  when  one  of  these  computations  has  been 
rewritten,  then  the  work  for  rewriting  both  of  them  has  been  done.  Furthermore,  the  structure  of  the 
sequence  DO  450/DO  500  has  the  same  structure  as  the  sequence  DO  650/DO  700,  so  that  this  pair  of 
computations  may  also  be  rewritten  in  tandem.  We  now  consider  each  of  the  DO  LOOPs  in  turn. 

The  first,  DO  LOOP  100,  is  a  simple  double-nested  structure: 

DO  100  J  =  5,  14 
DO  100  I  =  1,  NEQ 
100  T(J,I)  =  ZERO 


Since  this  is  a  doubly  nested  DO  LOOP,  there  is  no  BLAS  routine  which  will  perform  the  work  on 
both  indices  simultaneously.  However,  it  is  possible  to  replace  one  of  the  DO  LOOPs  with  the  DSCAL 
command,  while  retaining  the  other.  In  general,  it  is  more  efficient  to  structure  the  BLAS  so  that  each 
call  references  the  greatest  number  of  elements  of  the  arrays  concerned.  In  this  case,  the  innermost  DO 
LOOP  is  rewritten; 


DO  100  J=5,14 

CALL  DSCAL(NEQ,ZERO,T(J,l),NDIM) 
100  CONTINUE 
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where  DSCAL  is  the  BLAS  routine  which  multiplies  the  selected  elements  of  an  array  by  a  constant. 
Since  NDIM  is  the  row  dimension  of  array  T,  this  operation  overwrites  all  the  elements  of  the  Jth  row 
of  array  T  with  ZERO  at  each  iteration  J. 

The  second  DO  LOOP  is  far  more  complex: 


DO  300  I  =  1,  NEQ 
T(3,I)  =  Y(I) 

T(9,I)  =  DY(I) 

IF  (NORDER  .GT.  0)  THEN 
DEN  =  ZERO 
DO  250  J  =  1,  NORDER 
DEN  =  DEN  +  H(J) 

T(J+9,I)  =  (T(J+8.I)-T(J+3,I))/DEN 
T(J+3,I)  =  T(J+8,I) 

250  CONTINUE 
END  IF 

T(NORDER+4,I)  =  T(NORDER+9.I) 
300  CONTINUE 


This  DO  LOOP  contains  several  array  copy  commands,  a  partial  reduction,  and  the  scaling  of  an  array 
by  a  constant  value.  The  structure  is  further  complicated  by  the  presence  of  an  IF...THEN  block. 


The  first  thing  to  do  is  to  see  if  any  of  the  LOOP  can  be  simplified.  Inspection  shows  that  the  first, 
second,  and  final  commands  are  done  irrespective  of  the  IF..  .THEN  block,  and  furthermore  do  not  access 
any  array  values  which  may  be  modified  within  the  IF...THEN  block.  These  may  therefore  be  removed 
immediately  from  the  nested  DO  LOOP  structure  and  replaced  by  BLAS  DCOPY  calls: 


CALL  DCOPY(NEQ,Y(l),l,T(3,l),NDIM) 

CALL  DC0PY(NEQ,DY(1),1,T(9,1),ND1M) 

DO  3001=  1,  NEQ 
IF  (NORDER  .GT.  0)  THEN 
DEN  =  ZERO 
DO  250  J  =  1,  NORDER 
DEN  =  DEN  +  H(J) 

T(J+9,I)  =  (T(J+8,I)-T(J+3,I))/DEN 
T(J+3,I)  =  T(J+8,I) 

250  CONTINUE 
END  IF 

300  CONTINUE 

CALLDC0PY(NEQ,T(N0RDER+9,1),NDIM,T(N0RDER44,1),NDIM) 
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(Note  also  that  even  though  the  final  DCOPY  command  involves  the  array  T(NORDER+9,l),  which  is 
affected  by  the  IF...THEN  block,  that  array  is  only  used  after  the  IF...THEN  block  has  been  completed, 
and  so  it  is  permissible  to  remove  the  array  from  the  loop  by  using  the  DCOPY  command.) 


Once  this  step  has  been  performed,  it  is  easily  seen  that  the  IF,..THEN  block  encompasses  all  of  the 
numerical  computation  within  both  DO  LOOPS,  so  that  all  of  the  numerical  work  within  the  DO  LOOPs 
is  determined  by  the  value  of  NORDER,  which  is  defined  outside  of  the  DO  LOOPs.  It  is  therefore  more 
efficient  to  move  the  IF...THEN  block  outside  of  the  DO  LOOPs  so  that  the  LOGICAL  IF  need  only  be 
evaluated  once  for  this  set  of  DO  LOOPs,  instead  of  NEQ  times: 


CALL  DCOPY(NEQ,Y(l),l,T(3,l),NDIM) 

CALL  DCOPY(NEQ,DY(l),l ,T(9, 1),NDIM) 

IF  (NORDER  .GT.  0)  THEN 
DO  300  I  =  1,  NEQ 
DEN  =  ZERO 
DO  250  J  =  1,  NORDER 
DEN  =  DEN  +  H(J) 

T(J+9,I)  =  (T(J+8,I)-T(J+3,I))/DEN 
T(J+3,I)  =  T(J+8,I) 

250  CONTINUE 
300  CONTINUE 
END  IF 

CALLDCOPY(NEQ,T(NORDER+9,l),NDIM,T(NORDER+4,l),NDIM) 


When  the  DO  LOOP  structure  has  been  reconfigured  in  this  form,  then  the  parallelism  within  the  inner 
loop  becomes  apparent.  It  is  tempting  to  immediately  rewrite  the  loop  over  250,  by  replacing  the 
recursion  over  J  in  tenns  of  a  BLAS  DCOPY  and  DAXPY  for  the  first  line  and  a  DCOPY  in  the  second 
line.  But  this  is  impossible,  because  the  value  DEN  in  the  first  line  changes  with  J,  and  so  the  first  line 
cannot  be  written  as  a  DAXPY  over  J.  (Recall  that  the  DAXPY  command  scales  the  elements  of  one 
matrix  by  a  constant  and  adds  the  results  to  a  second  matrix.)  However,  it  is  still  possible  to  write  the 
woiic  in  terms  of  the  BLAS  routines,  if  one  first  reverses  the  order  in  which  the  DO  LOOPs  are  executed: 


CALL  DCOPY(NEQ,Y(l),l,T(3,l),NDIM) 
CALLDCOPY(NEQ,DY(l),l,T(9,l),NDIM) 
IF  (NORDER  .GT.  0)  THEN 
DEN  =  ZERO 
DO  250  J  =  1,  NORDER 
DEN  =  DEN  +  H(J) 

DO  3001=  1,NEQ 
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T(J+9,I)  =  (T(J+8,I)-T(J+3,I))/DEN 
T(J+3,I)  =  T(J+8,I) 

300  CONTINUE 
250  CONTINUE 
END  IF 

CALL  DCOPY(NEQ,T(NORDER+9,l),NDIM,T(NORDER+4,l)J^IM) 


With  the  logic  appearing  in  this  form,  the  expression  of  the  remaining  DO  LOOPS  in  terms  of  the 
BLAS  is  trivial,  since  DEN  is  a  constant  across  all  values  of  I  for  a  constant  value  of  J: 


CALL  DCOPY(NEQ,Y(l),l,T(3,l),NDIM) 

CALL  DCOPY(NEQ,DY(l),l,T(9,l),NDIM) 

IF  (NORDER  .GT.  0)  THEN 
DEN  =  ZERO 
DO  250  J  =  1,  NORDER 
DEN  =  DEN  +  H(J) 

TMP=1.0D0/DEN 

CALL  DSCAL(NEQ,ZERO,T(J+9,l),NDIM) 

CALL  DAXPY(NEQ,-TMP,T(J+3,1),NDIM,T(J+9.1),NDIM) 

CALL  DAXPY(NEQ,TMP,T(J+8,1),NDIM,T(J+9,1).NDIM) 
CALLDCOPY(NEQ,T(J+8,1),NDIM,T(Jh-3,1)JMDIM) 

250  CONTINUE 
END  IF 

CALL  DCOPY(NEQ,T(NORDER+9, 1  ),NDIM,T(NORDER+4, 1  ),NDIM) 


where  the  only  change  to  the  previous  form  is  the  introduction  of  the  temporary  TMP  to  hold  the 
reciprocal  of  DEN  at  each  step.  Remember  that  in  the  absence  of  any  data  dependency  the  order  in  which 
the  operations  are  performed  should  not  affect  the  final  results,  except  for  possible  roundoff  error.  This 
is  also  why  the  elements  of  T(J+8,I)  and/or  T(J+3,I)  were  not  scaled  and  then  copied  into  T(J+9,I).  The 
final  value  of  T(J+9,I)  would  have  been  identical  at  each  step  but  the  values  of  the  arrays  T(J+8,I)  and 
T(J+3,I),  which  are  used  elsewhere  within  the  program,  would  have  been  changed. 


Given  the  translation  of  the  DO  250/DO  300  previous  sequence,  it  can  easily  be  shown  that  the  proper 
translation  of  the  DO  550/DO  600  sequence  is  as  follows: 


CALL  DCOPY(NEQ,DY(l),l,T(9,l),NDIM) 
IF(NORDER  .GT.  0)THEN 
DEN=ZERO 
DO  550  J=l,  NORDER 
DEN=DEN+H(J) 
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TMP=ONE/DEN 

CALL  DSCAL(NEQ,ZER0,T(J+9,1)J^IM) 
CALLDAXPY(NEQ,TMP,T(J+8,1),NDIM,T(J+9,1)J^IM) 
CALLDAXPY(NEQ,-TMP,T(J+3,l).NDIM.T(J+9.1)jsrDIM) 
550  CONTINUE 
END  IF 


The  original  FORTRAN  version  of  the  DO  450/DO  500  sequence  is  as  follows. 


DO  500  I  =  1.  NEQ 
TEMP  =  0.00 

IF  (NORDER  .GT.  0)  THEN 
DO  450  J  =  1,  NORDER 
TEMP  =  TEMP  +  T(4+J,I)*AA(J) 
450  CONTINUE 
END  IF 

T(1,I)  =  TEMP  +  T(4,I)*H(1) 

Y(I)  =  T(1.I)  +  T(14.I)  +  T(3,I) 

500  CONTINUE 


This  loop  also  contains  a  dot  product,  a  multiplication  of  an  array  by  a  constant,  and  several  matrix 
additions.  There  is  also  an  IF...THEN  block  interspersed  with  the  DO  LOOPS.  As  before,  it  is  tempting 
to  move  the  array  manipulations  which  are  outside  of  the  IF...THEN  block  to  BLAS  routines  outside  of 
both  DO  LOOPS.  It  appears  at  first  that  the  reference  to  T(14,I)  in  the  line  just  before  statement  label  500 
would  prevent  this,  because  if  NORDER  is  large  enough,  this  array  would  be  referenced  within  the  DO 
LOOP;  however,  the  program  has  limited  the  value  of  NORDER  to  4,  so  that  this  wiU  never  happen.  The 
real  barrier  to  the  use  of  the  BLAS  is  that  T(1,I)  depends  upon  TEMP,  which  is  conditionally  defined  by 
the  IF...THEN  block.  The  solution  is  to  break  up  the  computation  of  T(1,I)  into  two  pieces,  one  of  which 
is  dependent  upon  the  IF...THEN  block  and  one  of  which  is  independent  of  it: 


DO  5001=  1,NEQ 
T(1.I)  =  T(4,I)*H(1) 

TEMP  =  0.00 

IF  (NORDER  .GT.  0)  THEN 
DO  450  J  =  1,  NORDER 
TEMP  =  TEMP  +  T(4+J,I)*AA(J) 
450  CONTINUE 
END  IF 

T(1,I)  =  TEMP  +  T(1,I) 

Y(I)  =  T(1,I)  +  T(14,I)  +  T(3,I) 

500  CONTINUE 
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The  computation  of  Y(I)  may  also  be  broken  into  two  pieces,  one  of  which  is  independent  of  the 
IF... THEN  block,  and  one  which  depends  implicitly  upon  the  IF... THEN  block  through  the  dependence 
upon  T(1,I): 


DO  500  I  =  1,  NEQ 
T(1,I)  =  T(4,I)*H(1) 

Y®  =  T(1.I)  +  T(14,I)  +  T(3,I) 
TEMP  =  0.00 

IF  (NORDER  .GT.  0)  THEN 
DO  450  J  =  1,  NORDER 
TEMP  =  TEMP  +  T(4+J,I)*AA(J) 
450  CONTINUE 
END  IF 

T(1,I)  =  TEMP  +  T(1,I) 

Y(I)  =  T(1,I)  +  Y(I) 

500  CONTINUE 


Once  the  separation  has  been  performed,  separation  of  the  code  into  BLAS  calls  is  performed: 


CALL  DCOPY(NEQ,T(4, 1  ),NDIM,T(  1 ,1),NDIM) 
CALL  DSCAL(NEQ,H(1),T(1,1).NDIM) 

CALL  DCOPY(NEQ,T(14,l),NDIM,Y(l),l) 
CALL  DAXPY(NEQ,ONE,T(3,l),NDIM,Y(l),l) 
DO  500  I  =  1,  NEQ 
TEMP  =  0.00 

IF  (NORDER  .GT.  0)  THEN 
DO  450  J  =  1,  NORDER 
TEMP  =  TEMP  +  T(4+J,I)*AA(J) 

450  CONTINUE 
END  IF 

T(1,I)  =  TEMP  +  T(1,I) 

500  CONTINUE 

CALL  DAXPY(NEQ,ONE,T(l,l),NDIM,Y(l),l) 


The  structure  of  the  nested  DO  LOOPs  and  the  IF.. .THEN  statement  may  be  collapsed  once  it  is 
recognized  that  the  only  array  whose  value  is  changed  is  T(1,I):  the  nested  DO  LOOPs  may  be  replaced 
by  a  single  DO  LOOP  containing  a  DDOT,  and  the  single  DO  LOOP  enclosed  within  the  IF...THEN 
statement,  completing  the  optimization: 


CALL  DCOPY(NEQ,T(4,l),NDIM,T(l,l),NDIM) 
CALL  DSCAL(NEQ,H(1),T(1,1),NDIM) 
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CALL  DC0PY(NEQ,T(14,1),NDIM.Y(1),1) 

CALL  DAXPY(NEQ,0NE.T(3,l),NDIM,Y(l).l) 

IF  (NORDER  .GT.  0)  THEN 
DO  500  I  =  1,  NEQ 

T(1,I)  =  T(1,I)  +  DDOT(NORDER,T(5,I).l,AA(l),l) 
500  CONTINUE 
END  IF 

CALL  DAXPY(NEQ,ONE,T(l,l),NDIM.Y(l),l) 


There  are  two  caveats  to  the  preceding  discussion.  Note  first  that  time  is  saved  by  calculating  the  final 
value  of  T(1,I)  first,  including  the  IF...THEN  dependence,  before  adding  the  value  of  T(1,I)  to  array  Y, 
rather  than  adding  each  piece  to  Y  as  it  is  calculated.  Second,  it  is  important  to  notice  that  the  vector 
T(4,I)  was  copied  to  T(l,l)  in  the  first  line,  and  then  T(1,I)  was  scaled  in  the  second  line.  If  the  scaling 
had  been  performed  on  T(4,l)  before  copying,  then  the  value  of  T(l,l)  would  have  remained  correct,  but 
the  value  of  T(4,I)  would  have  been  different  than  the  value  produced  by  the  original  DO  LOOP  structure, 
changing  the  values  returned  by  the  subroutine.  The  DO  LOOP  sequence  in  DO  650/DO  700  may  be 
replaced  by  BLAS  calls  similar  to  those  just  discussed: 


CALL  DC0PY(NEQ,T(9,1),ND1M,T(2,1),NDIM) 

CALL  DSCAL(NEQ,H(1),T(2,1),ND1M) 
IF(NORDER.GT.0)THEN 
DO  650  1=1, NEQ 

T(2,1)=T(2,I)-DDOT(NORDER,T(10,1),1,BB(1),1) 
650  CONTINUE 
END  IF 

CALL  DCOPY(NEQ,T(2,l),NDIM,Y(l),l) 
CALLDAXPY(NEQ,ONE,T(3,l),NDIM,Y(l),l) 

CALL  DAXPY(NEQ,ONE,T(  14, 1),NDIM,Y(1  ),1 ) 
CALLDAXPY(NEQ,0NE,T(2,1),ND1M,T(14,1),ND1M) 
CALL  DAXPY(NEQ,-ONE,Y(l),l,T(14,l)J^IM) 
CALLDAXPY(NEQ,0NE,T(3,1),ND1M.T(14.1),NDIM) 


4.3.  Compiler-Level  Optimizations.  This  section  is  concerned  with  the  compiler  optimizations  used 
on  the  different  machines.  It  is  important  first  to  realize  that  the  Crays  have  different  compiler  options 
than  the  IBM  machines:  not  only  are  there  differences  in  the  flags  used  to  signal  the  use  of  a  given 
compiler  option,  but  some  options  are  used  by  default  on  one  machine  which  must  be  called  explicitly  on 
the  other  machines,  or  which  do  not  even  exist  on  the  other  machines.  Since  an  exhaustive  study  of  all 
of  the  possible  compiler  options  is  beyond  the  scope  of  this  report,  the  following  table  lists  the  compiler 
options  which  were  investigated,  the  calling  sequence  on  each  machine,  and  the  default  value  of  that 
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compiler  option  on  each  machine.  If  a  given  compiler  option  is  nonexistent  on  a  particular  machine,  then 
the  legend  under  that  entry  will  read  "N/A"  (not  applicable).  A  brief  description  of  the  names  and  the 
functions  of  the  various  compiler  options  follows  the  table. 


Table  1.  Compiler  Options  Used  on  Each  Machine 


Machine 

Optimization 

Compiler  Options 
Calling  Sequence 

Default  Value 

Cray 

scalar  optimization 

-0  scalar 

disabled 

(X-MP/48) 

vector  optimization 

-0  vector 

enabled 

(Y-MP/8128) 

subroutine  inlining 

-0  inline 

disabled 

(C-90/16256) 

preprocessor 

or 

-I  subroutine  name 
N/A 

N/A 

scalar  optimization- 

03 

disabled 

vector  optimization 

N/A 

N/A 

subroutine  inlining 

disabled 

IBM  RS/6000 

preprocessor 

-Pk  (+options) 

disabled 

(Model  550  and 
Model  370) 

or 

-Pv  (-Hoptions) 

disabled 

*  This  option  may  be  invoked  directly,  using  the  command  "-inline_from  (subroutinejuune),”  or  called  as  an  option  from  either 
of  the  preprocessors. 


The  scalar  optimization  compiler  flag  instructs  the  compiler  to  perform  various  changes  to  the  code, 
in  the  interest  of  increased  computational  efficiency,  when  compiling  the  original  FORTRAN.  For 
example,  on  many  machines,  floating-point  divisions  are  considerably  more  costly  than  floating-point 
additions  and/or  multiplications.  (On  the  IBM  RS/6000,  for  example,  a  single  floating-point  division  can 
take  up  to  19  times  as  long  as  a  single  floating-point  addition  or  multiplication.)  One  way  to  help  reduce 
the  CPU  required  by  the  program,  therefore,  would  be  to  replace  a  floating-point  division  (say  division 
by  a  constant  within  a  DO  LOOP)  by  a  reciprocation  of  the  same  constant  before  the  DO  LOOP,  and  to 
replace  the  division  within  the  DO  LOOP  with  multiplication  by  the  reciprocal.  The  risk  associated  with 
such  a  replacement  is  that,  due  to  the  binary  representation  of  floating-point  arithmetic  on  the  computer, 
the  two  results  might  not  be  bitwise  identical.  This  may  lead  to  the  accumulation  of  roundoff  errors  as 
execution  of  the  program  continues.  In  most  cases,  however,  the  differences  in  the  final  computed  results 
are  negligible,  while  the  speedup  is  significant.  Therefore,  even  though  the  scalar  optimization  flag  is  not 
performed  by  default  on  many  machines,  use  of  the  scalar  optimizer  is  strongly  encouraged. 
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The  vector  optimization  flag  only  exists  on  the  Cray  machines,  since  only  they  have  vector  processors. 
This  optimization  helps  in  the  vectorization  of  certain  FORTRAN  constructs  such  as  a  reduction  (i.e.,  the 
summation  of  all  elements  of  a  vector  into  a  single  scalar  value).  The  deficiency  of  the  vector 
optimization  on  the  Grays  is  that  its  use  is  mutually  exclusive  from  the  scalar  optimization. 

The  subroutine  inlining  flag  allows  the  compiler  to  replace  a  call  to  a  specific  subroutine  with  an 
inline  (hence  the  name)  version  of  the  subroutine  itself.  For  subroutines  which  are  called  repeatedly,  this 
will  save  CPU  time  by  eliminating  the  overhead  associated  with  transferring  control  of  the  program  to  the 
subroutine. 

The  preprocessor  flags  are  similar  to  the  scalar  optimization  flags,  except  that  they  perform  much  more 
drastic  changes  to  the  FORTRAN  source  code.  Since  the  changes  made  to  the  source  code  are  dependent 
upon  both  the  specific  machine  used  and  upon  the  program  itself,  the  discussion  and/or  comparison  of  the 
preprocessors  available  upon  the  machines  used  here,  and  of  their  effects  upon  the  program  considered 
here,  will  not  necessarily  be  applicable  to  programs  in  general  or  to  work  stations  in  general.  (Specific 
details  of  the  options  and  the  performance  of  optimizing  preprocessors  on  the  various  machines  are  given 
in  proprietary  publications  supplied  by  the  hardware  vendors  and  will  not  be  included  here.)  Some 
illustrative  examples  of  the  changes  which  may  be  made  to  the  FORTRAN  by  optimizing  preprocessors 
include  the  following: 


•  Generation  of  calls  to  the  BLAS  routines. 

•  "Unrolling"  of  DO  LOOPS  to  improve  memory  access. 

•  Changing  of  IF  statement  to  BLOCK  IF  constructs. 

•  Loop  nest  reordering. 

For  this  work,  we  have  compared  the  performance  of  two  versions  of  this  program  on  each  of  the  IBM 
machines  to  the  performance  of  the  "original"  version  on  each  of  the  Cray  machines.  The  two  versions 
are  identical,  except  for  the  replacement  of  certain  FORTRAN  DO  LOOPs  with  BLAS  as  detailed  in 
section  4.2.  The  Cray  version  of  the  program  also  differs  slightly  from  the  version  used  on  the  work 
stations  in  two  respects.  First,  as  noted  earlier,  Cray  computers  have  64-bit  words,  and  so  the  compiler 
options  used  on  the  Grays  include  the  -dp  option  in  order  that  the  FORTRAN  source  code  used  on  the 
other  machines  will  not  require  extensive  modification  before  porting  it  to  the  Grays.  Second,  the  random 
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number  generator  used  on  the  other  machines  is  not  present  on  the  Grays,  so  a  proprietaiy  Cray  Research, 
Inc.  random  number  generator  is  used  instead.  We  compare  the  performance  of  each  version  of  the 
program  subject  to  the  following  initial  conditions: 

•  Compilation  using  only  default  optimization  (note  that  this  includes  vector  optimization  on  the 
Grays). 

•  Compilation  using  scalar  optimization. 

•  Compilation  using  scalar  optimization  in  conjunction  with  the  preprocessor  (on  the  work  stations, 
use  of  the  preprocessors  includes  inlining:  on  the  Grays,  the  inlining  must  be  invoked  separately). 

Since  the  correspondence  between  the  compiler  options  on  the  different  machines  is  not  exact,  the 
conditions  under  which  the  runs  are  compared  will  not  be  "identical"  from  machine  to  machine,  and  this 
should  be  borne  in  mind  during  the  discussion  of  the  timings.  The  timings  of  the  different  runs  and  a 
discussion  of  the  performance  are  given  in  the  next  section. 

5.  RESULTS  AND  DISCUSSION 

The  runs  performed  on  each  version  of  the  program  on  each  machine  are  as  follows. 


Machine  Name 

Version  of  Program 

Compiler  Options® 

CPU  Seconds 

Cray  X-MP/48 

original 

-Wf,”-dp  -ensciv” 

6973.8 

original 

-Wf,”-dp  -ensciv 

-0  scalar  -A  fast” 

6643.1 

original 

-Wf,”-dp  -ensciv 

-0  inline  -o  scalar 

-A  fast” 

6674.8 

Cray  Y-MP/8128 

original 

-Wf,”-dp  -ensciv” 

5781.0 

original 

-Wf.”-dp  -ensciv 

-0  scalar  -A  fast” 

5774.1 

‘  These  options  are  discussed  in  section  4.3. 
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Machine  Name 

Version  of  Program 

Compiler  Options® 

CPU  Seconds 

Cray  C-90/16256 

original 

-Wf,”-dp  -ensciv” 

3977.0 

original 

-Wf,”-dp  -ensciv 

-0  scalar  -A  fast” 

3975.1 

IBM  RS/6000 

original 

-qfips  -qlist  -Q 

15320.4 

Model  550 

original 

-03  -qfips  -qlist  -Q 

10215.5 

original  +  BLAS  1 

-03  -qfips  -qlist  -Q 

10126.7 

original 

-03  -qfips  -qlist  -Q 

-Pv,options 

10950.6 

original 

-03  -qfips  -qlist  -Q 

-Pk,options 

9878.8 

IBM  RS/6000 

original 

-qfips  -qlist  -Q 

10356.4 

Model  370 

original 

-03  -qfips  -qlist  -Q 

6714.3 

original  +  BLAS 

-03  -qfips  -qlist  -Q 

6929.5 

original 

-03  -qfips  -qlist  -Q 

-Pv,options 

6743.8 

original 

-03  -qfips  -qlist  -Q 

-Pk,  options 

6335.2 

®  These  options  are  discussed  in  section  4.3. 

The  timings  differ  both  by  machine  and  by  the  compiler  options  selected.  We  first  discuss  the 
performance  on  the  Cray  X-MP/48,  then  the  performance  on  the  IBM  work  stations.  (The  runs  on  the 
Y-MP/8128  and  C-90/16256  are  included  for  comparison.)  Finally  we  relate  the  execution  of  the  program 
to  the  architecture  of  the  machines  used  and  make  recommendations  for  optimization  of  more  general 
FORTRAN  codes. 

For  the  Cray  X-MP/48,  it  is  noteworthy  that  the  best  performance  does  not  result  from  the  use  of  the 
default  compiler  (which  emphasizes  optimization  of  vector  constructs),  but  rather  from  the  use  of  scalar 
optimization;  this  will  be  discussed  in  more  detail  later.  In  addition,  the  use  of  the  inlining  option  on  the 
Cray  X-MP/48  does  not  significantly  increase  the  execution  rate  of  the  program.  (To  ensure  that  the 
subroutines  chosen  for  inlining  were  actually  those  which  would  be  likely  to  improve  the  program's 
performance,  the  program  was  first  mn  on  the  Cray  X-MP/48  with  the  -flowtrace  option,  in  order  to  find 
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which  subroutines  were  the  best  candidates  for  inlining.  Neither  increasing  the  depth  of  subroutines  which 
were  inlined,  nor  explicitly  naming  the  most  favorable  candidates,  significantly  increased  the  execution 
rate  of  the  program.) 

The  execution  rate  of  this  program  was  actually  enhanced  by  the  use  of  the  scalar  optimization  on  the 
Cray  X-MP/48  as  opposed  to  the  default  vector  optimization.  This  is  contrary  to  the  "usual"  behavior  of 
programs  on  a  Cray,  where  vectorization  of  FORTRAN  structures  is  recommended  to  achieve  optimum 
performance.  Programs  that  achieve  optimum  performance  on  a  Cray  usually  contain  either  well-defined 
numerical  subroutine  calls  (matrix  multiply,  solution  of  linear  equations,  etc.),  which  are  both  well 
vectorized  (for  example  the  LINPACK  and  EISPACK  families  of  linear  algebra  subroutines  present  in  the 
Cray  scientific  library  routines)  and  numerically  intensive.  Extensive  use  of  these  routines  within  a 
program  virtually  guarantees  that  the  majority  of  the  computational  work  within  the  program  will  execute 
optimally  on  a  Cray.  Also,  even  if  the  application  being  run  does  not  explicitly  contain  calls  to  vectorized 
scientific  subroutine  libraries,  many  scientific  application  programs  are  constructed  such  that  the  majority 
of  the  numerical  work  in  the  program  is  performed  in  DO  LOOPS  of  the  type  described  in  section  4.2. 

The  ratio  of  CPU  times  for  the  two  IBM  RISC  machines  are  approximately  proportional  to  their 
respective  clock  speeds,  as  expected.  The  performance  of  different  versions  of  the  program  increases  in 
the  order  Raw  code  <  Scalar-optimized  raw  code  <  Scalar-optimized  raw  code  -i-BLAS  <  Scalar-optimized 
raw  code  +  Preprocessor. 

It  is  obvious  that  a  significant  improvement  results  from  the  inclusion  of  any  optimization  on  the  IBM 
machines.  It  has  been  our  experience  that  significant  increases  in  throughput  on  scalar  work  stations  are 
readily  attainable  by  minor  coding  changes  and/or  use  of  compiler  flags.  For  the  program  considered  here, 
a  50%  improvement  in  performance  was  obtained  by  using  simple  compiler  optimizations.  This 
improvement  made  it  possible  for  a  RISC  work  station  (the  IBM  370)  to  outperform  the  Cray  X-MP 
supercomputer  on  a  prototype  production  program.  (In  previous  work  (Unekis  et  al.,  1994),  Unekis  has 
improved  the  CPU  performance  of  another  computational  chemistry  code  by  a  factor  of  5  on  an 
IBM  RS/60(X)  RISC  workstation  using  the  methods  discussed  here). 

Even  though  the  use  of  the  compiler  flags  gave  markedly  superior  results  to  the  use  of  the  BLAS  for 
this  program  on  these  work  stations,  in  general  it  is  better  if  the  user  makes  comparisons  on  a  case-by-case 
basis  rather  than  relying  on  compiler  optimizations  alone.  There  are  two  main  reasons  for  this.  The  first. 
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of  course,  is  portability.  Even  though  the  IBM  preprocessors  and  compiler  optimizations  worked  very 
efficiently  for  this  program,  the  IBM  compilers  are  only  available  on  IBM  machines  (and  the  quality  of 
the  compilers  and  preprocessors  may  vary  from  vendor  to  vendor),  whereas  the  BLAS  are  portable  across 
a  wide  variety  of  platforms  and  have  been  optimized  separately  by  each  vendor  for  their  particular 
hardware.  The  second  is  that  the  use  of  the  BLAS  encourages  the  programmer  to  think  both  about  the 
efficiency  of  the  program  and  about  other  potential  users  of  the  program.  Overreliance  on  the  compiler 
may  lead  to  lazy  programming  habits,  and  in  practice  to  inefficient  usage  of  the  machine  if  one  set  of 
compiler  options  is  inappropriate  for  a  given  program.  On  the  other  hand,  overreliance  on  hand-tuning 
of  a  program  may  lead  to  nonportable  FORTRAN,  or  to  a  program  which  is  incomprehensible  to  other 
users.  The  use  of  the  BLAS  is  a  middle  ground  between  these  two  extremes. 

One  of  the  important  lessons  to  be  learned  from  these  benchmaiking  studies  is  the  interplay  of  the 
FORTRAN  and  the  machine  architecture  in  code  optimization.  For  the  molecular  dynamics  program 
considered  here,  the  majority  of  the  executable  statements  in  the  program  are  concerned  with  the 
calculation  of  the  contribution  to  the  interaction  potential  by  each  distinguishable  pair  of  particles  in  the 
simulation,  and  the  resulting  derivatives  (forces).  Even  though  the  formal  evaluation  of  the  potential  (see 
equations  1-6)  appears  to  contain  many  vectorizable  statements,  both  the  segmented  representation  of  the 
interaction  potential  and  the  nonunifonm  stride  through  memory  inhibit  vectorization  of  this  portion  of  the 
program.  This  means  that  the  majority  of  the  executable  statements  within  the  program  are  processed  in 
scalar  mode,  for  which  the  IBM  work  stations  are  approximately  as  fast  as  the  Cray  vector  machines.  The 
difference  in  the  overall  performance  of  the  program  on  the  workstations  as  opposed  to  the  Grays  is 
dependent  upon  the  remainder  of  the  program  (primarily  within  the  numerical  integrator)  which  consists 
of  DO  LOOPS  as  described  in  section  4.2.  These  statements  are  ordinarily  performed  optimally  on  the 
Cray  architecture  by  vectorization,  but  are  bottlenecks  on  the  scalar  processors  due  to  the  necessity  of 
accessing  the  proper  array  elements  from  the  memory.  The  high  proportion  of  scalar  statements  within 
the  program,  therefore,  makes  the  use  of  a  vector  processor  relatively  inefficient:  since  the  program 
executes  primarily  in  scalar  mode,  the  use  of  alternative  computer  architectures,  such  as  high-end  scalar 
work  stations,  may  become  competitive. 

Further  work  in  this  area  will  include  the  modification  of  the  program  to  run  on  a  massively  parallel 
computer,  and  a  discussion  of  the  changes  made  to  the  algorithm  and/or  the  program  required  by  the  new 
computer  platform. 
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6.  SUMMARY  AND  CONCLUSIONS 


The  findings  of  this  set  of  benchmarking  studies  are  summarized  as  follows. 

•  The  program  consists  primarily  of  nonvectorizable  (scalar)  code,  with  a  significant  minority  of 
vectorizable  statements.  These  vectorizable  statements  execute  well  on  the  Cray  machines  but  become 
bottlenecks  for  the  IBM  woik  stations. 

•  The  performance  of  the  IBM  RS/6000  woric  stations  can  be  significantly  improved  by  making 
simple  changes  to  the  program  or  by  using  compiler  optimization  directives. 

•  When  these  changes  are  made,  the  IBM  RS/6000  workstations  become  competitive  with  the  Cray 
supercomputers  and  can  even  outperform  the  Cray  XMP/48. 
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